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Abstract 

Deception detection has been receiving 
an increasing amount of attention from 
the computational linguistics, speech, and 
multimodal processing communities. One 
of the major challenges encountered in this 
task is the availability of data, and most of 
the research work to date has been con¬ 
ducted on acted or artificially collected 
data. The generated deception models 
are thus lacking real-world evidence. In 
this paper, we explore the use of multi¬ 
modal real-life data for the task of decep¬ 
tion detection. We develop a new decep¬ 
tion dataset consisting of videos from real- 
life scenarios, and build deception tools 
relying on verbal and nonverbal features. 

We achieve classification accuracies in the 
range of 77-82% when using a model that 
extracts and fuses features from the lin¬ 
guistic and visual modalities. We show 
that these results outperform the human 
capability of identifying deceit. 

1 Introduction 

As deceptive behavior occurs on a daily basis in 
different areas of life (Meyer, 2010; Smith et al., 
2014), the need arises for automated methodolo¬ 
gies to detect deception in an efficient, yet reliable 
manner. There arc many applications that can ben¬ 
efit from automatic deception identification, such 
as airport security screening, crime investigation 
and interrogation, interviews, advertisement, and 
others. In many of these settings, the polygraph 
test has been used as the main method to identify 
deceptive behavior. However, this method requires 
the use of skin-contact devices and human exper¬ 
tise, making it infeasible for large-scale applica¬ 
tions. Moreover, polygraph tests were shown to be 
misleading in multiple cases (Vrij, 2001; Gannon 
et al., 2009), as human judgment is often biased. 


Given the difficulties associated with the use 
of polygraph-like methods, learning-based ap¬ 
proaches have been proposed to address the de¬ 
ception detection task using a number of modali¬ 
ties, including text (Feng et al., 2012) and speech 
(Hirschberg et al., 2005; Newman et al., 2003). 
Unlike the polygraph methods, learning-based 
methods for deception detection rely mainly on 
data collected from deceivers and truth-tellers. 
The data is usually elicited from human contrib¬ 
utors, in a lab setting or via crowdsourcing. An 
important problem identified in this data-driven re¬ 
search is the lack of real data. Because of the arti¬ 
ficial setting, the subjects may not be emotionally 
aroused, as they may not take the experiments seri¬ 
ously given the lack of motivation and/or penalty. 

In this paper, we describe what we believe is a 
first attempt at building a multimodal system that 
detects deception in real-life settings. We collect 
a dataset consisting of 118 deceptive and truthful 
video clips, from real trials and live street inter¬ 
views aired in television shows. We use the tran¬ 
scription of these videos to extract several linguis¬ 
tic features, and we manually annotate the videos 
for the presence of several gestures that arc used to 
extract nonverbal features. We then build a system 
that jointly uses the verbal and nonverbal modali¬ 
ties to automatically detect the presence of decep¬ 
tion. Our experiments show that the multimodal 
system can identify deception with an accuracy in 
the range of 77-82%, significantly improving over 
the baseline. In addition, we present a study on 
the human ability to detect deception in single or 
multimodal data streams, and show that our sys¬ 
tem outperforms humans on this task. 

2 Dataset 

Our goal is to build a multimodal collection of oc¬ 
currences of real deception, which will allow us 
to analyze both verbal and nonverbal behaviors in 
relation to deception. 
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Figure 1: Sample screenshots showing facial displays and hand gestures from real-life deception and 
truthful clips. Starting at the top left-hand corner: deceptive interview with up gaze {Up), deceptive 
interview with side gaze {Side), deceptive trial with both hands {Both-H), truthful trial with forward 
head {Forward), truthful interview with side turn {Side-Turn), and truthful interview with single hand 
{Single-H). 


Truthful 

Deceptive 

I was sentenced to forty to sixty years in prison 
for this crime that I didn’t commit. At the trial 
the judge had exceeded the sentence guidelines 
because he said I failed to show remorse. And I 
told him, you know, I felt terrible for what happen 
to this woman, shouldn’t happen to anyone, but I 
can’t show remorse for something I didn’t do. 

We had some drinks at the bar, maybe one ... two. 
um I got onto the dance floor myself as I ex¬ 
plained, um I have been a trained dancer for some 
time, going to be able to dance freely is like a ... 
release. I'm very much in my own space when I 
do that and so I got up, and I was dancing alone 
on the dance floor. 

It’s difficult to pick just one but um I think Ten¬ 
der Mercies uh is ... really captured my imagi¬ 
nation um when I was in junior high. Had a lot 
to do with Robert Duval’s performance certainly 
and that got me excited about the possibility of 
um .... pulling off an acting career for myself. 

Yeah, yeah he was convincing as a wolf. Ahhh 
actually you know ahhh this is like crazy I'm ter¬ 
rified from wolves, it’s my worst fear even though 
they don’t exist but thats my worst fear, sharks 
and stuff like that. Yeah its my worst fear, I am 
being honest with you. 


Table 1: Sample transcripts for deceptive and truthful clips. The first row presents transcripts from the 
Trials domain while the second shows transcripts corresponding to the Interviews domain. 


2.1 Data Collection 

To collect real deception data, we start by identi¬ 
fying online multimedia sources where deceptive 
behavior can be observed and verified. We specif¬ 
ically target videos of people, on which we en¬ 
force some of the constraints imposed by current 
data processing technologies: the person in the 
video should be in front of the camera; her face 
should be clearly visible; visual quality should 
be clear enough to identify the facial expressions; 
and finally, audio quality should be clear enough 
to hear the voices and understand what the per¬ 
son is saying. We collect video clips from pub¬ 
lic real trials and interviews aired during television 
shows, where the truth or falsehood of the partic¬ 


ipant’s statements ends up being known. Video 
clips from trials consist of statements from wit¬ 
nesses and defendants in the same trial. In or¬ 
der to have a clear distinction between deceptive 
and truthful trial videos portraying defendants, the 
process of labeling the trial relies on the verdict. 
Thus, clips with a guilty verdict are considered 
deceptive whereas clips with a non-guilty verdict 
or exoneration are labeled as truthful. Clips con¬ 
taining witness testimonies are labeled as truth¬ 
ful if their statements are verified by police in¬ 
vestigations. Examples of trials included in our 
dataset are Jodi Arias, Andrea Sneiderman, and 
Amanda Hayes. Exoneree’s statements were taken 
from “The Innocence Project” (http://www. 
innocencepro ject. org). 
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Deceptive and truthful responses arc also col¬ 
lected from TV shows and interviews. Examples 
of such shows arc “Lie Witness,” “Golden Balls,” 
and the “American Film Institute” and “RevYOU” 
You-Tube channels. Deceptive videos portray sce¬ 
narios where interviewees’ responses were known 
to be a lie. For example, the interviewer asks a ran¬ 
dom individual on his opinion on a non-existing 
film where the interviewee fabricates a story. On 
the other hand, truthful videos arc collected from 
individuals asked on their opinions on real movies. 

Given our goals and constraints, data collec¬ 
tion ended up being a lengthy and laborious pro¬ 
cess consisting of several iterations of Web min¬ 
ing, data processing and analysis, and content val¬ 
idation. 

The final dataset includes 118 videos, includ¬ 
ing 59 that arc labeled as deceptive and 59 la¬ 
beled as truthful. Among them, 62 belong to the 
TV street interviews and shows category ( Inter¬ 
views) with 28 deceptive and 34 truthful video 
clips, and 56 belong to the trials category ( Trials ) 
with 31 deceptive and 25 truthful clips. The aver¬ 
age length of the videos in the dataset is 27.28 sec¬ 
onds, with an average length of 33.02 seconds for 
the truthful clips and 21.54 seconds for the decep¬ 
tive clips. Collected trial samples cover famous 
murder cases, while street interviews cover sev¬ 
eral topics such as movies, music, politics, and re¬ 
ligion. The dataset contains 23 unique female and 
39 unique male speakers, with their ages ranging 
approximately between 16 and 60 years. 

2.2 Transcriptions and Nonverbal Behavior 
Annotations 

Our goal is to analyze both verbal and nonverbal 
behavior to understand their relation to deception. 

First, all the video clips were manually tran¬ 
scribed. The transcription was performed by two 
transcribers using the Elan software (Wittenburg 
et ah, 2006). We asked transcribers to include 
word repetitions and fillers such as um, ah, and 
uh, as well as long pauses that were marked using 
three consecutive dots. The final set of transcrip¬ 
tions contain 7835 words, with an average of 66 
words per transcript. Table 1 shows transcriptions 
of sample deceptive and truthful statements from 
both trials and reality shows. 

Second, we annotate the gestures 1 observed 
during the interactions in the video clips. We 

1 As done in the Human-Computer Interaction commu¬ 
nity, we use the term “gesture” to broadly refer to body move¬ 
ments, including facial expressions and hand gestures. 


Gesture Category 

Agreement 

Kappa 

Facial Expressions 

72.88% 

0.576 

Eyebrows 

80.51% 

0.656 

Eyes 

68.64% 

0.517 

Gaze 

61.40% 

0.432 

Mouth Openness 

77.97% 

0.361 

Mouth Lips 

82.20% 

0.684 

Head Movements 

55.08% 

0.420 

Hand Movements 

91.53% 

0.858 

Hand Trajectory 

84.75% 

0.753 

Average 

75.00% 

0.584 


Table 2: Gesture annotation agreement 


specifically focus on the annotation of facial dis¬ 
plays and hand movements, as they have been pre¬ 
viously found to correlate with deceptive behav¬ 
ior (Depaulo et ah, 2003). The gesture annotation 
is performed using the MUMIN coding scheme 
(Allwood et al., 2007). 

In the MUMIN scheme, facial displays consist 
of several different facial expressions associated 
with eyebrows, eyes, gaze, and mouth. Smile, 
laughter, and scowl are also included, as well as 
general head and hand movements. 

The multimodal annotation was performed by 
two annotators using the Elan software (Witten¬ 
burg et al., 2006). We decided to perform the ges¬ 
ture annotations at video level, rather than at utter¬ 
ance level, because the overall judgment of truth¬ 
fulness and deceitfulness is based on the whole 
video content. During the annotation process, an¬ 
notators were allowed to watch each video clip as 
many times as they needed. They were asked to 
identify the facial displays and hand gestures that 
were most frequently observed or dominating dur¬ 
ing the entire clip duration. For each video clip, 
the annotators had to choose one label for each of 
the nine gestures listed in Table 3. 

Table 3 shows the frequency counts associated 
with the nine gestures considered during the an¬ 
notation. Note that the counts under each gesture 
add up to 118, reflecting the fact that for every ges¬ 
ture, the annotators had to choose one label for ev¬ 
ery video clip. When none of the labels applied, 
the “Other” category was selected. In the case 
of gestures associated with hand movements, the 
“Other” label also accounted for those cases where 
the speaker’s hands were not moving or were not 
visible. 

After all the video clips were annotated for 
gestures, the inter-annotator agreement was mea- 
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Label 

Count 

Label 

Count 

Label 

Count 

Eyebrows 


General Facial Expressions 

Hand Trajectory 


Frown (Frowning) 

17 

Smile 

41 

Up (Upwards) 

13 

Raise (Raising) 

71 

Scowl 

13 

Down (Downwards) 

5 

Other 

30 

Laugh (Laughter) 

1 

Sideways 

5 

Eyes 


Other 

63 

Complex 

33 

X-open (Exaggerated opening) 

17 

Mouth Openness 


Other 

62 

Close-BE (Closing both) 

7 

Close-M (Closed mouth) 

26 

Head Movements 


Closing-E (Closing one) 

1 

Open-M (Open mouth) 

92 

Down (Single nod) 

3 

Close-R (Closing repeated) 

20 

Mouth Lips 


Down-R (Repeated nods) 

48 

Other 

73 

Up-C (Corners up) 

61 

Forward (Move forward) 

3 

Gaze 


Down-C (Corners down) 

51 

Back (Move backward) 

3 

Interlocutor 

69 

Protruded 

1 

Side-tilt (Single tilt) 

8 

Up 

7 

Retracted 

5 

Side-Tilt-R (Repeated tilts) 

9 

Down 

14 

Hand Movements 


Side-Turn 

9 

Side 

24 

Both hands (Both-H) 

31 

Side-Tum-R (Shake repeated) 

26 

Other 

4 

Single hands (Single-H) 

26 

Waggle 

3 



Other 

61 

Other 

6 


Table 3: Frequency counts for nine facial displays and hand gestures 


sured. Table 2 shows the observed annotation 
agreement between the two annotators, along with 
the Kappa statistic. The agreement measure rep¬ 
resents the percentage of times the two annotators 
agreed on the same label for each gesture category. 
For instance, 72.88% of the time the annotators 
agreed on the label assigned to the General Face 
category. On average, the observed agreement was 
measured at 75%, with a Kappa of 0.58 (macro- 
averaged over the nine categories), which reflects 
substantial agreement. Observed agreement for 
Head Movements and Gaze is noticeably lower 
than other categories, which can be attributed to 
a higher number of available gesture choices, as 
seen in Table 3. 

3 Features of Verbal and Nonverbal 
Behaviors 

Given the multimodal nature of our dataset, we de¬ 
cided to focus on the linguistic and gesture compo¬ 
nents. In this section, we describe the sets of fea¬ 
tures extracted for each modality, which will then 
be used to build classifiers of deception. 

3.1 Verbal Features 

We implement three types of features, consisting 
of unigrams, psycholinguistic features, and syn¬ 
tactic complexity features. 

Unigrams. We extract unigrams derived from 
the bag-of-words representation of the video 
transcripts. The unigram features arc en¬ 
coded as word frequencies and include all the 
words present in the transcripts. 


Psycholinguistic Features. The Linguistic Word 
Count (LIWC) is a psycholinguistics lexicon 
that has been frequently used to incorporate 
semantic and psychological information into 
linguistic analysis (Pennebaker and Francis, 
1999). It has been successfully used in pre¬ 
vious work on deception detection (Newman 
et al., 2003; Mihalcea and Strapparava, 2009; 
Ott et al., 2011). We obtain features for each 
of the 80 psycholinguistic classes present in 
the lexicon by calculating the percentage of 
words in the transcription belonging to each 
class. 

Syntactic Complexity. We also extract features 
to measure the syntactic complexity of the 
speech produced by the speakers in truth¬ 
ful and deceptive clips. This set of features 
is motivated by previous research that has 
suggested that deceivers’ speech has lower 
complexity (Depaulo et al., 2003). We use 
the tool described in (Lu, 2010), which gen¬ 
erates indexes of syntactic complexity, in¬ 
cluding general complexity metrics, length of 
production, and amount of coordination. The 
set of features consists of fourteen indexes 
including statistics related to T-units, which 
are linguistic units that include a main clause 
in addition to attached subordinate clauses. 
T-unit analysis is extensively used to ana¬ 
lyze syntactic complexity in speech and writ¬ 
ten content. The set of features includes the 
mean length of sentence, mean length of T- 
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■ Deceptive Truthful 


1 



Figure 2: Distribution of nonverbal features for deceptive and truthful groups 


unit, mean length of clause, clauses per sen¬ 
tence, verb phrases per T-unit, clauses per T- 
unit, dependent clauses per clause, dependent 
clauses per T-unit, T-units per sentence, com¬ 
plex T-unit ratio, coordinate phrases per T- 
unit, coordinate phrases per clause, complex 
nominals per T-unit, and complex nominals 
per clause. 

3.2 Nonverbal Features 

The nonverbal features are derived from the an¬ 
notations performed using the MUMIN coding 
scheme as described in section 2.2. We create a 
binary feature for each of the 40 available gesture 
labels. Each feature indicates the presence of a 
gesture only if it is observed during the majority 
of the interaction duration. The generated features 
represent nine different gesture categories cover¬ 
ing facial displays and hand movements. 

Facial Displays. These are facial expressions or 
head movements displayed by the speaker 
during the deceptive or truthful interaction. 
They include all the behaviors listed in Table 
3 under the General Facial Expressions, Eye¬ 
brows, Eyes, Mouth Openness, Mouth Lips, 
and Head Movements. 

Hand Gestures. The second broad category cov¬ 
ers gestures made with the hands, and it in¬ 
cludes the Hand Movements and Hand Tra¬ 
jectories listed in Table 3. 

4 Experiments 

We start our experiments with an analysis of the 
nonverbal behaviors occurring in deceptive and 
truthful videos. We compare the percentage of 
each behavior as observed in each class. For in¬ 
stance, there is a total of 41 videos in the dataset 


Feature Set 

SVM 

DT 

RF 

Unigrams 

69.49% 

76.27% 

67.79% 

Psycholinguistic 

53.38% 

50.00% 

66.10% 

Syntactic Complexity 

52.54% 

62.71% 

53.38% 

Facial Displays 

78.81% 

74.57% 

67.79% 

Hand Gestures 

59.32% 

57.62% 

57.62% 

Unigr.+Facial Disp. 

71.18% 

70.33% 

68.64% 

All Verbal 

65.25% 

63.55 % 

57.62 % 

All Nonverbal 

75.42% 

68.64% 

72.03% 

All Features 

77.11% 

69.49% 

73.72% 


Table 4: Deception classifiers using individual and 
combined sets of verbal and nonverbal features. 


that include the Smile feature (as shown in Ta¬ 
ble 3), out of which 12 arc paid of the deceptive 
set of 59 videos, and 29 arc part of the truthful 
set (again, of 59 videos). Hence, the percentages 
for this feature are 20.33% in the deceptive class, 
and 49.13% in the truthful class. Figure 2 shows 
the percentages of all the nonverbal features for 
which we observe noticeable differences for the 
deceptive and truthful groups. As the figure sug¬ 
gests, facial displays seem to help differentiate be¬ 
tween the deceptive and truthful conditions. For 
instance, we can observe that truth-tellers smile 
(Smile) and blink more (Close-R). Interestingly 
deceivers seem to make more eye contact (Inter¬ 
locutor gaze) and nod (Side-Turn-R) more fre¬ 
quently than truth-tellers. This agrees with the 
findings in (Depaulo et al., 2003) that liars who arc 
more motivated to get away with their lies (i.e., tri¬ 
als) arc likely to increase their eye-contact behav¬ 
ior. 

Motivated by these results, we proceed to con¬ 
duct further experiments to evaluate the perfor¬ 
mance of the extracted features using a machine 
learning approach. 
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Feature Weights 


Feature Set 

SVM 

All 

77.11% 

- Hand gestures 

74.57% 

- Facial displays 

64.40% 

- Syntactic 

76.27% 

- Semantic 

72.03% 

- Unigrams 

73.72% 


Table 5: Feature ablation study. 


We run our learning experiments on the real- 
deception dataset introduced earlier. Given the 
even distribution between deceptive and truthful 
clips, the baseline on this dataset is 50%. For 
each video clip, we create feature vectors formed 
by combinations of the verbal and nonverbal fea¬ 
tures described in the previous section. We build 
deception classifiers using three classification al¬ 
gorithms: Support Vector Machines (SVM), De¬ 
cision Trees (DT), and Random Forest (RF). 2 We 
run several comparative experiments using leave- 
one-out cross-validation. Table 4 shows the accu¬ 
racy figures obtained by the three classifiers on the 
major feature groups described in Section 3. As 
shown in this table, the facial displays classifier 
achieves the highest accuracy among the individ¬ 
ual classifiers, followed by the unigrams classifier. 

We also evaluate classifiers that rely on com¬ 
bined sets of features. The nonverbal features 
clearly outperform the verbal features, and the 
classifier that includes all the features improves 
over the classifiers that rely on all the verbal fea¬ 
tures or all the nonverbal features. Importantly, 
several of the classifiers improve significantly over 
the baseline. 

4.1 Analysis of Feature Contribution 

To better understand the contribution of the dif¬ 
ferent feature sets to the overall classifier perfor¬ 
mance, we conduct an ablation study where we re¬ 
move one group of features at a time. Given that 
SVM had the best performance in our initial set of 
experiments, we run all our analysis experiments 
only using this classifier. Table 5 shows the accu¬ 
racies obtained when one feature group is removed 
and the deception classifier is built using the re¬ 
maining features. From this table, we can again 
observe that Facial Displays contribute the most 
to the classifier performance, while Syntactic Fea¬ 
tures show the lowest contribution. 

2 We use the implementation available in the Weka toolkit 
with the default parameters. 



Figure 3: Weights of top nonverbal features 

For a closer look at the contribution of indi¬ 
vidual features included in the group of Facial 
Displays, we analyzed the absolute values of the 
weights assigned by the learning algorithm to the 
features in this group. Figure 3 shows the fea¬ 
tures normalized with respect to the largest fea¬ 
ture weight. The five most predictive features are 
the presence of side turns, up gazes, blinking, and 
smiling, which we previously identified as possi¬ 
ble indicators of deception. This further confirms 
our initial hypothesis that gestures associated with 
human interaction are an important component of 
human deception. 

We also analyze the contribution of the lin¬ 
guistic features. Using the linguistic ethnogra¬ 
phy method (Mihalcea and Pulman, 2009), we ob¬ 
tain the most dominant LIWC word classes asso¬ 
ciated with deceptive and truthful transcripts ex¬ 
tracted from trials and interviews clips. Results 
are shown in Table 6. Interestingly, the most dom¬ 
inant classes in truthful clips, regardless of being 
from interviews or trials, correspond to words re¬ 
lated to Family, Home, and Humans. This sug¬ 
gests that truth-tellers show similar word usage 
when interviewed on a real scenario. On the other 
hand, dominant classes associated to deceivers are 
less consistent as they discuss aspects related to 
the topic being discussed. For instance, while be¬ 
ing interviewed about a non-existing movie, de¬ 
ceivers talk about their Past, Assent, and use Mo¬ 
tion words in order to support their - lies. In con¬ 
trast, while being on trial stating their (false) inno¬ 
cence, they use Anxiety, Anger, and negative emo- 
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Truthful 


Interviews Trials 


Class 

Score 

Class 

Score 

Metaphor 

2.98 

You 

3.99 

Money 

2.74 

Family 

3.07 

Inhibition 

2.74 

Home 

2.45 

Home 

2.13 

Humans 

1.87 

Humans 

2.02 

Posemo 

1.81 

Family 

1.96 

Insight 

1.64 


Deceptive 


Interviews Trials 


Class 

Score 

Class 

Score 

Assent 

4.81 

Anger 

2.61 

Past 

2.59 

Anxiety 

2.61 

Sexual 

2.00 

Certain 

2.28 

Other 

1.87 

Death 

1.96 

Motion 

1.68 

Physical 

1.77 

Negemo 

1.44 

Negemo 

1.52 


Table 6: LIWC word classes most strongly associ¬ 
ated with deception and truth. 

tion words (class Negemo). In line with earlier ob¬ 
servations (Mihalcea and Strapparava, 2009). de¬ 
ceptive texts include more words that reflect cer¬ 
tainty (class Certain, with words such as com¬ 
pletely, truly, always ) and more references to oth¬ 
ers (class Other, with words such as she, day, him). 

4.2 Domain Experiments 

We perform three sets of experiments to determine 
the role played by the domain. The first set of ex¬ 
periments uses only the Inter-views video clips (62 
in total), and the results arc shown in the left col¬ 
umn of Table 7. The second set uses only the Tri¬ 
als instances (56 in total), with results shown in the 
right column of Table 7. Finally, we also perform 
cross-domain experiments, with the training data 
drawn from one domain and the test data from the 
other. The results of these experiments are shown 
in Table 8. Given the uneven distribution of the 
truthful and deceptive video clips in two domains, 
the baselines arc 54.83% for the Inter-views do¬ 
main (34 truthful, 28 deceptive), and 55.35% for 
the Trials domain (25 truthful, 31 deceptive). 

What we learn from these experiments is that 
the domain does matter. Despite the smaller 
dataset, the experiments run on one domain at a 
time lead to results that arc higher than the ones 
obtained with more data but with a mix of do¬ 
mains. The cross-domain experiments also sup¬ 
port this argument, as the performance drops sig¬ 


Feature Set 

Interviews 

Trials 

Baseline 

54.83% 

55.35% 

Unigrams 

75.80 % 

82.14% 

Psycholinguistics 

59.67% 

50.00% 

Syntactic Complexity 

54.83% 

60.71% 

Facial Displays 

70.96% 

80.35% 

Hand Gestures 

56.45% 

48.21% 

Unigr.+Facial Disp. 

70.96% 

76.78% 

All Verbal 

70.96% 

64.28% 

All Nonverbal 

67.14% 

83.92% 

All features 

79.03% 

82.14% 


Table 7: Deception classifiers for the Inter-views 
and Trials domains, using a SVM classifier trained 
on individual and combined sets of verbal and 
nonverbal features. 


Training 

Test 

SVM 

Trials 

Interviews 

58.06% 

Interviews 

Trials 

58.92% 


Table 8: Cross-domain classification results using 
a SVM classifier trained on all the features 

nificantly when there is no overlap in domain be¬ 
tween the training and the test instances. Over¬ 
all, in all our machine learning experiments, the 
combined classifier that makes use of all the verbal 
and nonverbal features achieves the best trade-off 
between performance and robustness, as it always 
leads to the best or second best performance across 
all the experiments using individual or combined 
feature sets. While a classifier based on an individ¬ 
ual feature set can sometime lead to a better per¬ 
formance (e.g., the Facial Displays classifier has 
better performance when all the video clips arc 
used), that same classifier may not perform well in 
another setting (e.g., the Facial Displays classifier 
is significantly below the All Features classifier in 
the domain experiments). 

5 Human Performance 

An important remaining question is concerned 
with the human performance on the task of de¬ 
ception detection. An answer to this question can 
shed light on the difficulty of the task, and can also 
place our results in perspective. 

We conduct a study where we evaluate the hu¬ 
man ability to identify deceit when exposed to four 
different modalities: Text, consisting of the lan¬ 
guage transcript; Audio, consisting of the audio 
track of the clip; Silent video, consisting of only 
the video with muted audio; and Full video, where 
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Modality 

Agreement 

Kappa 

Text 

58.80% 

0.047 

Audio 

66.70% 

0.288 

Silent video 

52.00% 

0.065 

Full Video 

61.60% 

0.191 


Table 9: Agreement among three human annota¬ 
tors on text, audio, silent video, and full video 
modalities. 



Text 

Audio 

Silent video 

Full video 

Al 

54.24% 

58.47% 

50.85% 

63.00% 

A2 

55.93% 

67.80% 

45.76% 

68.00% 

A3 

65.25% 

70.34% 

55.93% 

71.00% 

Sys. 

65.75% 

NA 

75.42% 

77.11% 


Table 10: Performance of three annotators and 
the developed automatic system (Sys) on the real- 
deception dataset over four modalities. 


audio and video are played simultaneously. We 
create an annotation interface that shows an anno¬ 
tator instances for each modality in random order, 
and ask him or her to select a label of either “De¬ 
ception” or “Truth” according to his or her percep¬ 
tion of truthfulness or falsehood. 

To avoid annotation bias, we show the modal¬ 
ities in the following order: first we show either 
Text or Silent video , then we show Audio, followed 
by Full video. Note that apart from this constraint 
which is enforced over the four modalities belong¬ 
ing to each video clip, the order in which instances 
arc presented to an annotator is random. Further¬ 
more, the annotators did not have access to any 
information that would reveal the true label of an 
instance. The only exception to this could have 
been the annotators’ previous knowledge of some 
of the public trials in our dataset. A discussion 
with the annotators after the annotation took place 
indicated however that this was not the case. 

Three annotators labeled all the 118 video clips 
in the dataset. Since four modalities were ex¬ 
tracted from each video, each annotator annotated 
a total of 412 instances. Annotators were not of¬ 
fered a monetary reward and we considered their 
judgments to be honest as they participated volun¬ 
tarily in this experiment. Table 9 shows the ob¬ 
served agreement and Kappa statistics among the 
three annotators for each modality. 3 The agree¬ 
ment for most modalities is rather low and the 
Kappa scores range between slight to fair agree¬ 
ment. As noted before (Ott et al., 2011), this low 

3 Inter-rater agreement with multiple raters and variables. 

https://mlnl.net/jg/software/ira/ 


agreement can be interpreted as an indication that 
people are poor judges of deception. 

We also determine each annotator’s perfor¬ 
mance for each modality. The results, shown in 
Table 10, additionally support the argument that 
human judges have difficulty performing the de¬ 
ception detection task. An interesting, yet perhaps 
unsurprising observation is that the human perfor¬ 
mance increases with the availability of modali¬ 
ties. The poorest accuracy is obtained in Silent 
video, followed by Text, Audio, and Full Video 
where the judges have the highest performance. 

Overall, our study indicates that detecting de¬ 
ception is indeed a difficult task for humans and 
further verifies previous findings where human 
ability to spot liars was found to be slightly better 
than chance (Aamodt and Custer, 2006). More¬ 
over, the performance of the human annotators ap¬ 
peal's to be significantly below that of our system. 

6 Related Work 

Verbal Deception Detection. To date, several re¬ 
search publications on verbal-based deception de¬ 
tection have explored the identification of decep¬ 
tive content in a variety of domains, including on¬ 
line dating websites (Toma and Hancock, 2010; 
Guadagno et al., 2012), forums (Warkentin et al., 
2010; Joinson and Dietz-Uhler, 2002), social net¬ 
works (Ho and Hollister, 2013), and consumer re¬ 
port websites (Ott et al., 2011; Li et al., 2014). 
Research findings have shown the effectiveness 
of features derived from text analysis, which fre¬ 
quently includes basic linguistic representations 
such as n-grams and sentence count statistics (Mi- 
halcea and Strapparava, 2009), and also more 
complex linguistic features derived from syntac¬ 
tic CFG trees and part of speech tags (Feng et al., 
2012; Xu and Zhao, 2012). Research work has 
also relied on the LIWC lexicon to build deception 
models using machine learning approaches (Mi- 
halcea and Strapparava, 2009; Angela Almela et 
al., 2012) and showed that the use of psycholin¬ 
guists information is helpful for the automatic 
identification of deceit. Following the hypothe¬ 
sis that deceivers might create less complex sen¬ 
tences in an effort to conceal the truth and being 
able to recall their lies more easily, several re¬ 
searchers have also studied the relation between 
text syntactic complexity and deception (Yancheva 
and Rudzicz, 2013). 

Nonverbal Deception Detection. Earlier ap¬ 
proaches to nonverbal deception detection relied 
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on polygraph tests to detect deceptive behavior. 
These tests arc mainly based on such physiolog¬ 
ical features such as heart rate, respiration rate, 
skin temperature. Several studies (Vrij, 2001; 
Gannon et ah, 2009; Derksen, 2012) indicated 
that relying solely on physiological measurements 
can be biased and misleading. Chittaranjan et 
al. (Chittaranjan and Hung, 2010) created an au¬ 
dio visual recording of the "Arc you a Werewolf?” 
game in order to detect deceptive behaviour us¬ 
ing non-verbal audio cues and to predict the sub¬ 
jects’ decisions in the game. For hand gestures, 
blob analysis was used to detect deceit by track¬ 
ing the hand movements of the subjects (Lu et ah, 
2005; Tsechpenakis et al., 2005), or using geo¬ 
metric features related to the hand and head mo¬ 
tion (Meservy et ah, 2005). Caso et al. (Caso 
et al., 2006) identified particular hand gestures 
that can be related to the act of deception using 
data from simulated interviews. Cohen et al. 
(2010) found that fewer iconic hand gestures were 
a sign of a deceptive narration, and Hillman et al. 
(2012) determined that increased speech prompt¬ 
ing gestures were associated with deception while 
increased rhythmic pulsing gestures were associ¬ 
ated with truthful behavior. Also related is the 
taxonomy of hand gestures developed by (Mar- 
icchiolo et ah, ) for deception and social behav¬ 
ior. Facial expressions also played a critical role 
in the identification of deception. (Ekman, 2001) 
defined micro-expressions as relatively short in¬ 
voluntary expressions, which can be indicative of 
deceptive behavior. Moreover, these expressions 
were analyzed using smoothness and asymmetry 
measurements to further relate them to an act of 
deceit (Ekman, 2003). Tian et al. (Tian et al., 
2005) considered features such as face orienta¬ 
tion and facial expression intensity. Owayjan et 
al. (Owayjan et al., 2012) extracted geometric- 
based features from facial expressions, and Pfis- 
ter and Pietikainen (Pfister and Pietikainen, 2012) 
developed a micro-expression dataset to identify 
expressions that are clues for deception. Recently, 
features from different modalities were integrated 
in order to find a combination of multimodal fea¬ 
tures with superior performance (Burgoon et al., 
2009; Jensen et al., 2010). A multimodal decep¬ 
tion dataset consisting of linguistic, thermal, and 
physiological features was introduced in (Perez- 
Rosas et al., 2014), which was then used to de¬ 
velop a multimodal deception detection system 
(Abouelenien et al., 2014). An extensive review 


of approaches for evaluating human credibility us¬ 
ing physiological, visual, acoustic, and linguistic 
features is available in (Nunamaker et al., 2012). 

7 Conclusions 

In this paper we presented a study of multimodal 
deception detection using real-life occurrences of 
deceit. We introduced a novel dataset covering 
recordings from public real trials and street inter¬ 
views, and used this dataset to perform both qual¬ 
itative and quantitative experiments. Our analy¬ 
sis of nonverbal behaviors occurring in deceptive 
and truthful videos brought insight into the ges¬ 
tures that play a role in deception. We also built 
classifiers relying on individual or combined sets 
of verbal and nonverbal features, and showed that 
we can achieve accuracies in the range of 77-82%. 

Additional analyses showed the role played by 
the various feature sets used in the experiments, 
and the importance of the domain. To place our re¬ 
sults in perspective and better understand the dif¬ 
ficulty of the task, we performed a study of hu¬ 
man ability to detect deception, which revealed 
high disagreement among the annotators. Our au¬ 
tomatic system outperforms the human detection 
of deceit by 6-15%. 

To our knowledge this is the first work to auto¬ 
matically detect instances of deceit using both ver¬ 
bal and nonverbal features extracted from real de¬ 
ception data. In order to develop a fully automated 
deception deception system, our future work will 
address the use of automatic gesture and facial ex¬ 
pression identification and automated speech tran¬ 
scription. Our goal is to move forward towards a 
real-time deception detection system. 

The dataset introduced in this paper is publicly 
available from http://lit.eecs.umich.edu. 
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